Write incremental results after each task completion by juppytt · Pull Request #93 · pinchbench/skill

juppytt · 2026-04-02T07:24:27Z

Summary

Write partial result JSON after each task finishes grading
External tools can poll the result file to show live progress
Partial results include in_progress: true, completed_tasks, and total_tasks fields
Final write at the end overwrites without these fields

This enables dashboards and monitoring tools to display per-task progress while a benchmark run is still in progress.

Session transcripts were deleted between tasks by cleanup_agent_sessions, making post-run debugging impossible. Now transcripts are copied to results/{run_id}_transcripts/{task_id}.jsonl before cleanup. Also fixes pre-existing duplicate _remove_readonly function definition that caused a SyntaxError on import. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When --judge is specified with a model ID, the judge calls the model API directly instead of running an OpenClaw agent session. This avoids OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only grading instructions, which caused all llm_judge tasks to score 0. Supported model prefixes: - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY) - anthropic/* -> Anthropic Messages API (ANTHROPIC_API_KEY) - openai/* -> OpenAI chat completions (OPENAI_API_KEY) - claude -> headless Claude CLI (claude -p) Without --judge, behavior is unchanged (OpenClaw agent session). Also fixes pre-existing duplicate _remove_readonly function definition in lib_agent.py that caused an IndentationError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The function was defined twice on consecutive lines with the second definition shadowing the first. Also removed an extra bare func(path) call outside the try/except block.

Update the result JSON after every task finishes grading so external tools can poll progress while the benchmark is still running. The partial result includes in_progress=true, completed_tasks, and total_tasks fields. The final write at the end overwrites without these fields.

juppytt and others added 6 commits April 1, 2026 20:24

Merge branch 'feat/transcript-archive'

0b34ba5

Fix duplicate _remove_readonly definition causing IndentationError

fde4b00

The function was defined twice on consecutive lines with the second definition shadowing the first. Also removed an extra bare func(path) call outside the try/except block.

Merge branch 'fix/duplicate-remove-readonly'

91efe4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write incremental results after each task completion#93

Write incremental results after each task completion#93
juppytt wants to merge 6 commits intopinchbench:mainfrom
juppytt:feat/incremental-results

juppytt commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

juppytt commented Apr 2, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant